This project aims at exploring and analysing a dataset about red wine quality using special statistical programming language which is R. The dataset includes 1599 observations of 13 variables.
This is the summary of the dataset
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
This shows the different data types and values of the dataset variables
The dataset contains of 1599 observations of 13 variables.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This section visulizes a plot for each variable in the dataset. A description of the shape/center/spread of the plot (histogram) is stated clearely under each one.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of the fixed.acidity histogram is right-skewed. The range of data is 11.3. The main peak is approximately at 7.There is a small gap between the range 15 and 15.5. Most observations fall in the range 7.10 - 9.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of the volatile.acidity histogram is right-skewed with a short tail in the right. The range of data is 1.46. There are two peaks at approxemately 0.4 and 0.6 so, we can say that this plot is bimodal.Most observations fall in the range 0.3 - 0.64.There is a gap after approxemately 1.3 so, there is an outliers in the right end of the plot
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of the citric.acid histogram is right-skewed .The range of data is 1. The peak is at 0 and this means that most red wines have zero critic acid.Most observations fall in the range 0.090 - 0.420.There is an outliers in the right end of the plot
ggplot(wineQualityReds, aes(x=residual.sugar)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of the residual.sugar histogram is right-skewed with a long tail in right end and with some gaps.The range of data is 14.6. The peak is at 2.Most observations fall in the range 1.900 - 2.600 and there are a lot of small bars displayed in the right end of the plot.
ggplot(wineQualityReds, aes(x=chlorides)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of the chlorides histogram is right-skewed with a long tail in right end and with some gaps although it looks normal in the left side around the peak.The range of data is 0.599. The median of the observations is 0.07900 .Most observations fall in the range 0.0.7000 - 0.09000 and there are a lot of small bars displayed in the right end of the plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of the free.sulfur.dioxide histogram is right-skewed with a gap in the right around 60.The range of data is 71. The median of the observations is 14 .Most observations fall in the range 7 - 21.
ggplot(wineQualityReds, aes(x=total.sulfur.dioxide)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of the total.sulfur.dioxide histogram is right-skewed with a short tail and with a gap between approximately 170 and 280.The range of data is 283. The median of the observations is 38 .Most observations fall in the range 22 - 62.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The distribution of the density histogram is almost symmetric.The range of data is 0.0136 The median of the observations is 0.9968 .Most observations fall in the range 0.9956 - 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of the pH histogram is almost symmetric.It could be bimodal because there are two closed peaks.The range of data is 1.27 The median of the observations is 3.310 .Most observations fall in the range 3.210 - 3.400.
ggplot(wineQualityReds, aes(x=sulphates)) + geom_histogram(binwidth = 0.03)+ scale_x_log10() # Transforming the data, since the regular plotting will result in a long tailed distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of the sulphates histogram is right-skewed with a long tail and many gaps in the right which means that there are some outliers.The range of data is 1.67 The median of the observations is 0.6200 .Most observations fall in the range 0.5500 - 0.7300.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of the alcohol histogram is right-skewed with a gap in the right after 14.The peak is at 9.5. The range of data is 6.5 The median of the observations is 10.20 .Most observations fall in the range 9.50 - 11.10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The range of data is 5 The median of the observations is 6 .Most red wines are of quality 5 and 6.
There is 1599 observations of 13 variables (index variable is included even though it’s not that imortant in the analysis).Each observation indicates one of the red wine samples.
The main feature of interest in this dataset is the quality.
I think all features in the dataset can support the investigation in a way or another. Some of the features may have big effect and some may have small effect but all can help to make the analysis and investigation easier and more accurate.
No.
The dataset is already tidy so, there was no need to adjust it. The first variable which is the index is not important but there was no need to remove it as keeping it will not affect the analysis badely. There were no unusual distributions and I noticed that most plots of the variables are right-skewed. There are no left-skewed distributions at all.
This section visulizes a relationship between some variables in the dataset. A description of the each plot is stated clearely under each one.
## X fixed.acidity volatile.acidity citric.acid
## X 1.00 -0.27 -0.01 -0.15
## fixed.acidity -0.27 1.00 -0.26 0.67
## volatile.acidity -0.01 -0.26 1.00 -0.55
## citric.acid -0.15 0.67 -0.55 1.00
## residual.sugar -0.03 0.11 0.00 0.14
## chlorides -0.12 0.09 0.06 0.20
## free.sulfur.dioxide 0.09 -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.12 -0.11 0.08 0.04
## density -0.37 0.67 0.02 0.36
## pH 0.14 -0.68 0.23 -0.54
## sulphates -0.13 0.18 -0.26 0.31
## alcohol 0.25 -0.06 -0.20 0.11
## quality 0.07 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## X -0.03 -0.12 0.09
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.12 -0.37 0.14 -0.13 0.25
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## X 0.07
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
The strength of a relationship between two correlated variables is determened by looking at the numbers. A correlation of 0 means that no relationship exists between the two variables, whereas a correlation of 1 indicates a perfect positive relationship. It is uncommon to find a perfect positive relationship in the real world. Chances are that if we find a positive correlation between two variables that the correlation will lie somewhere between 0 and 1.
The observations from above correlation matrix and correlation plot:
1- There is no relatioship between volatile acidity and residual sugar variables. 2- There is a negligible relatioship between many variables for example fixed acidity and residual sugar. 3- All variables have negligible to weak positive relationship with quality except alcohol that has the strongest positive relationship with quality. 4- residual.sugarhas the weakest positive relationship with quality. 5- volatile.acidity has the strongest negative relationship with quality but all variables in general have negligible to moderate negative relationship with quality. 6- free.sulfur.dioxide has the weakest negative relationship with quality.
Here, a relationship between some variables is visulized using scatter plots.
ggplot(wineQualityReds, aes(x=fixed.acidity, y=citric.acid)) + geom_point(alpha = 1/2)+
xlim(0.00,14)+
ylim(0.00,0.75)
It seems that ther is a very strong positive relationship between fixed.acidity and citric.acid.
ggplot(wineQualityReds, aes(x=volatile.acidity, y=residual.sugar)) + geom_point(alpha = 1/2)+
xlim(0.0,1.3)+
ylim(0,12)
It seems that ther is no relationship between volatile.acidity and residual.sugar.
ggplot(wineQualityReds, aes(x=fixed.acidity, y=chlorides)) + geom_point(alpha = 1/2)+
ylim(0.0,0.2)
It seems that ther is very negligible relationship between fixed.acidity and chlorides.
It seems that ther is strong negative relationship between fixed.acidity and pH.
ggplot(wineQualityReds, aes(x=sulphates, y=pH)) + geom_point(alpha = 1/2)+
xlim(0.0,1.4)
It seems that ther is weak negative relationship between sulphates and pH.
It seems that ther is moderate postive relationship between density and citric.acid.
There is only one exact zero correlation between variables which is between volatile.acidity and residual.sugar.
In generla, all variables have negligible to weak positive relationship with quality but alcohol has the strongest positive relationship and residual.sugar has the weakest positive relationship.
In general, all variables have negligible to moderate negative relationship with quality but volatile.acidity has the strongest negative relationship and free.sulfur.dioxide has the weakest negative relationship.
No, I didn’t.
alcohol has the strongest positive relationship with quality.
The variables that have the strongest relationship among others are the following:
1- fixed.acidity and citric.acid 2- fixed.acidity and density 3- free.sulfur.dioxide and total.sulfur.dioxide
The above plot shows the relationship between three variables: alcohol, pH and quality.Higher levels of alcohol associated with higher levels of quality but higher levels of PH associated with lower levels of quality.So,the wine becomes better when alcohol increases and PH decreases.
ggplot(data = wineQualityReds, aes(alcohol, sulphates, color = as.factor(quality))) +
geom_point()+
ylim(0.0,1.4)+theme_dark()
The above plot shows the relationship between three variables: sulphates, alcohol and quality.Higher levels of sulphates associated with higher levels of quality and also higher levels of alcohol associated with higher levels of quality.So, the wine becomes better when both alcohol and sulphates increase.
The main and important observation I made from the above multivariate plots is that better wine has higher level of alcohol and sulphates but lower level of pH.
I have studied only the relationship between few variables and I didn’t notice any surprising interactions between them.
This section display three plots, each with its own description. I chose the first plot from each of the three section, univariate plots section, bivariate plots section and multivariate plots section.
4.60 7.10 7.90 8.32 9.20 15.90
The distribution of the fixed.acidity histogram is right-skewed. The minimum value is 4.60 and the maximum is 15.90 so, the range of data is 11.3. The main peak is approximately at 7.There is a small gap between the range 15 and 15.5. Most observations fall in the range 7.10 - 9.20
It seems that ther is a very strong positive relationship between fixed.acidity and citric.acid with some few outliers.
The above plot shows the relationship between three variables: alcohol, pH and quality.Higher levels of alcohol associated with higher levels of quality but higher levels of PH associated with lower levels of quality.So,the wine becomes better when alcohol increases and PH decreases.
The dataset I worked in for this project contains of 1599 obseravations of 12 main variables.It is already tidy, there was no need for cleaning in the begining of the project.
The project was very interesting. The easiest but longest part is plotting histogram and summarize each feature and the main difficulty I faced while working in this project is dealing with multivariate analysis because it is something new for me that I didnt do it before using python.
I studied the relationship between many variables and I noticed that there is no relatioship between volatile acidity and residual sugar variables..All variables have negligible to weak positive relationship with quality except alcohol that has the strongest positive relationship with quality.residual.sugarhas the weakest positive relationship with quality.volatile.acidity has the strongest negative relationship with quality but all variables in general have negligible to moderate negative relationship with quality.free.sulfur.dioxide has the weakest negative relationship with quality.
In the future, I wich I can invest more time studying the relationship of the variables I didn’t explore in the project and make more multivariate plots.